NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Generation of accurate, expandable phylogenomic trees with uDance

https://doi.org/10.1038/s41587-023-01868-8

Balaban, Metin; Jiang, Yueyu; Zhu, Qiyun; McDonald, Daniel; Knight, Rob; Mirarab, Siavash (May 2024, Nature Biotechnology)

Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.
more » « less
Full Text Available
DEPP: Deep Learning Enables Extending Species Trees using Single Genes

https://doi.org/10.1093/sysbio/syac031

Jiang, Yueyu; Balaban, Metin; Zhu, Qiyun; Mirarab, Siavash; Solis-Lemus, ed., Claudia (April 2022, Systematic Biology)

Abstract Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without prespecified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multilocus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data. [Deep learning; gene tree discordance; metagenomics; microbiome analyses; neural networks; phylogenetic placement.]
more » « less
Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

https://doi.org/10.1093/bioadv/vbac055

Balaban, Metin; Bristy, Nishat Anjum; Faisal, Ahnaf; Bayzid, Md Shamsuzzoha; Mirarab, Siavash; Lengauer, ed., Thomas (August 2022, Bioinformatics Advances)

Abstract Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Availability and implementationOur software is available open source at https://github.com/nishatbristy007/NSB. Supplementary informationSupplementary data are available at Bioinformatics Advances online.
more » « less
Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

https://doi.org/10.1371/journal.pcbi.1009449

Sarmashghi, Shahab; Balaban, Metin; Rachtman, Eleonora; Touri, Behrouz; Mirarab, Siavash; Bafna, Vineet (November 2021, PLOS Computational Biology)
Segata, Nicola (Ed.)
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k -mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k -mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k -mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e= .
more » « less
Full Text Available
Phylogenetic double placement of mixed samples

https://doi.org/10.1093/bioinformatics/btaa489

Balaban, Metin; Mirarab, Siavash (July 2020, Bioinformatics)

Abstract Motivation Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. Results We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. Availability and implementation The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Fast and accurate distance‐based phylogenetic placement using divide and conquer

https://doi.org/10.1111/1755-0998.13527

Balaban, Metin; Jiang, Yueyu; Roush, Daniel; Zhu, Qiyun; Mirarab, Siavash (January 2021, Molecular Ecology Resources)

Full Text Available
Greengenes2 unifies microbial data in a single reference tree

https://doi.org/10.1038/s41587-023-01845-1

McDonald, Daniel; Jiang, Yueyu; Balaban, Metin; Cantrell, Kalen; Zhu, Qiyun; Gonzalez, Antonio; Morton, James T.; Nicolaou, Giorgia; Parks, Donovan H.; Karst, Søren M.; et al (July 2023, Nature Biotechnology)

Abstract Studies using 16S rRNA and shotgun metagenomics typically yield different results, usually attributed to PCR amplification biases. We introduce Greengenes2, a reference tree that unifies genomic and 16S rRNA databases in a consistent, integrated resource. By inserting sequences into a whole-genome phylogeny, we show that 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree.
more » « less
Complexity of avian evolution revealed by family-level genomes

https://doi.org/10.1038/s41586-024-07323-1

Stiller, Josefin; Feng, Shaohong; Chowdhury, Al-Aabid; Rivas-González, Iker; Duchêne, David A; Fang, Qi; Deng, Yuan; Kozlov, Alexey; Stamatakis, Alexandros; Claramunt, Santiago; et al (May 2024, Nature)

Abstract Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions^1–3. Here we address these issues by analysing the genomes of 363 bird species⁴(218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous–Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous–Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.
more » « less
Full Text Available
The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters

https://doi.org/10.1111/1755-0998.13135

Rachtman, Eleonora; Balaban, Metin; Bafna, Vineet; Mirarab, Siavash (May 2020, Molecular Ecology Resources)

Full Text Available
APPLES: Fast Distance-Based Phylogenetic Placement.

Balaban, Metin; Sarmashghi, Shahab; Mirarab, Siavash (May 2019, Lecture notes in computer science)

Full Text Available

« Prev Next »

Search for: All records